This week, we introduce the core idea of teaching a computer to learn concepts using data—without being explicitly programmed.
We are going to start by covering linear regression with one variable. Linear regression predicts a real-valued output based on an input value. We discuss the application of linear regression to housing price prediction, present the notion of a cost function, and introduce the gradient descent method for learning.
Linear Regresion with One Variable - Predicting a single output value from a single input value.
Hypothesis Function: $$h_\theta(x) = \theta_0 + \theta_1 x$$
We give to hθ values for θ0 and θ1 to get our output 'y'. In other words, we are trying to create a function called hθ that is able to reliably map our input data (the x's) to our output data (the y's).
Cost Function: $$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$$
To break it apart, it is $\frac{1}{2} \bar{x}$ where $\bar{x}$ is the mean of the squares of $h_\theta (x^{(i)}) - y^{(i)}$, or the difference between the predicted value and the actual value.
Hypothesis function accuracy is measured by the Cost Function
Gradient Descent: Taking the derivative of our Cost Function to minimize its output
$\alpha$, the learning rate
Gradient descent equation:
repeat until convergence:
$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ for j=0 and j=1
Or intuitively,
reapeat until convergence:
$\theta_j := \theta_j - \alpha [\text{Slope of tangent aka derivative}]$
The point of all this is that if we start with a guess for our hypothesis and then repeatedly
apply these gradient descent equations, our hypothesis will become more and more accurate.
Classification & Regression
Model Representation:
Note: indicies in tables and matrices are 1-based, not 0
m = number of training examples
x's = "input" variable/features
y's = "output" variable/"target" variable
(x,y) : single training example (a row in a table)
$(x^{(i)},y^{(i)})$ : $i^{th}$ training example
In [2]:
from notebook_utils import *
IMG('univariate-linear-regression.png')
Out[2]:
Given training example (x,y), chose $\theta_0$ and $\theta_1$ so that $h\theta(x)$ is close to y from our training data
In [12]:
IMG('cost-function.png', r=True)
Out[12]:
Below: "Find the values of $\theta_0$ & $\theta_1$ so that the average $\frac12m$ * the $\Sigma$ of squared errors is minimized. (Khan Academy - Squared error of regression)
In [11]:
IMG('minimize.png')
Out[11]:
Cost Function: $J(\theta_0,\theta1)$
In [10]:
IMG('derivative.png')
Out[10]:
Gradient Descent + Mean Squared Error = our first Learning Algorithm (Linear Regression)
In [14]:
IMG('gradient-descent.png')
Out[14]:
In [17]:
IMG('gradient-descent-algorithm.png',r=True)
Out[17]:
In [18]:
IMG('linear-regression-simple.png')
Out[18]:
In [19]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
m = 4
n = 2
input = np.array([
[1, 6],
[2, 5],
[3, 7],
[4, 10]
])
X = np.matrix([np.ones(m), input[:,0]]).T
y = np.matrix(input[:,1]).T
betaHat = (X.T * X).I * X.T * y
print(betaHat)
plt.figure(1)
xx = np.linspace(0, 5, 2)
yy = np.array(betaHat[0] + betaHat[1] * xx)
plt.plot(xx, yy.T, color='b')
plt.scatter(input[:,0], input[:,1], color='r')
plt.show()
Matrix: rectangular array of numbers
4 x 2 ( $\mathbb{R}^{4x2}$ ) and 2 x 3 ( $\mathbb{R}^{2x3}$ ) matrices below:
$\left[ \begin{array}{ccc} 1402 & 191 \\ 1371 & 821 \\ 949 & 1437 \\ 147 & 1448 \end{array} \right] \left[ \begin{array}{ccc} 1 & 2 & 3 \\ 4 & 5 & 6 \end{array} \right] $
Vector: An n x 1 matrix ( $\mathbb{R}^4$ )
$ \left[ \begin{array}{c} 460 \\ 232 \\ 315 \\ 178 \end{array} \right] $
Note: assume vectors are 1-indexed
A,B,C,X <-usually matrices
a,b,c,x <-usually vector or scalar*
Can only add matrices of the same dimension, ie. $\mathbb{R}^{4x2} + \mathbb{R}^{4x2}$
Just add the values at each position
$3 \times \left[ \begin{array}{cc} 1 & 0 \\ 2 & 5 \\ 3 & 1 \\ \end{array} \right] = \left[ \begin{array}{cc} 3 & 0 \\ 6 & 15 \\ 9 & 3 \\ \end{array} \right]$
Multiplying a matrix and a vector
In [23]:
IMG('matrix-vector-multiplication.png',r=True)
Out[23]:
Using Octave to compute Linear Regression compared to iterative: faster algo. & less code
In [22]:
IMG('octave-vs-iterative.png')
Out[22]:
In [27]:
IMG('matrix-matrix-multiplication.png')
Out[27]:
Split second matrix into vectors, do matrix-vector multiplication, combine into new matrix
1x1 + 3x0 + 2x5 = 11
4x1 + 0x0 + 1x5 = 9
1x3 + 3x1 + 2x2 = 10
4x3 + 0x1 + 1x2 = 14
$\left[ \begin{array}{cc} 11 & 10 \\ 9 & 14 \\ \end{array} \right]$
In [30]:
IMG('matrix-matrix-multi2.png')
Out[30]:
We can now use matrix multiplication to quickly (and efficiently) compute 3 possible hypothesis' for 4 different house sizes to get the 12 possible predictions!
We can also, of course, use libraries(eg. numpy) to compute these.
See image below:
In [32]:
IMG('matrix-matrix-multiplication-use.png')
Out[32]:
WATCH OUT:
Matrix multiplication is NOT commutative: $A \times B \ne B \times A$
Matrix multiplication IS associative.
Identity Matrix: 1s down the diagnol of an n x n matrix. Denoted: $I$ or $I_{n\times n}$ $\left[\begin{array}{ccc}1&0&0\\0&1&0\\0&0&1\end{array}\right]$
Only square matrices have inverses
If A is an m x m matrix, and if it has an inverse,
$A(A^{-1}) = A^{-1}A = I$
Example:
In [34]:
IMG('inverse-matrices.png')
Out[34]:
In [35]:
IMG('matrix-transpose.png')
Out[35]:
Example problem Quiz2q4:
$u=\left[\begin{array}{c}3\\-5\\4\end{array}\right]$ and $v=\left[\begin{array}{c}1\\2\\5\end{array}\right]$
What is $u^T$ (u Transposed)?
Answer:
$u^T=\left[\begin{array}{ccc}3&-5&4\end{array}\right]$ so,
Given m x n X n x m == some matrix m x m,
1 x 3 matrix multiplied by a 3 x 1 will result in a 1 x 1 matrix
In [36]:
3*1 + -5*2 + 4*5
Out[36]: